20 research outputs found

    Single Channel ECG for Obstructive Sleep Apnea Severity Detection using a Deep Learning Approach

    Full text link
    Obstructive sleep apnea (OSA) is a common sleep disorder caused by abnormal breathing. The severity of OSA can lead to many symptoms such as sudden cardiac death (SCD). Polysomnography (PSG) is a gold standard for OSA diagnosis. It records many signals from the patient's body for at least one whole night and calculates the Apnea-Hypopnea Index (AHI) which is the number of apnea or hypopnea incidences per hour. This value is then used to classify patients into OSA severity levels. However, it has many disadvantages and limitations. Consequently, we proposed a novel methodology of OSA severity classification using a Deep Learning approach. We focused on the classification between normal subjects (AHI 30). The 15-second raw ECG records with apnea or hypopnea events were used with a series of deep learning models. The main advantages of our proposed method include easier data acquisition, instantaneous OSA severity detection, and effective feature extraction without domain knowledge from expertise. To evaluate our proposed method, 545 subjects of which 364 were normal and 181 were severe OSA patients obtained from the MrOS sleep study (Visit 1) database were used with the k-fold cross-validation technique. The accuracy of 79.45\% for OSA severity classification with sensitivity, specificity, and F-score was achieved. This is significantly higher than the results from the SVM classifier with RR Intervals and ECG derived respiration (EDR) signal feature extraction. The promising result shows that this proposed method is a good start for the detection of OSA severity from a single channel ECG which can be obtained from wearable devices at home and can also be applied to near real-time alerting systems such as before SCD occurs

    E-Stream: Evolution-Based Technique for Stream Clustering

    Full text link
    Abstract. Data streams have recently attracted attention for their applicability to numerous domains including credit fraud detection, network intrusion detection, and click streams. Stream clustering is a technique that performs cluster analysis of data streams that is able to monitor the results in real time. A data stream is continuously generated sequences of data for which the characteristics of the data evolve over time. A good stream clustering algorithm should recognize such evolution and yield a cluster model that conforms to the current data. In this paper, we propose a new technique for stream clustering which supports five evolutions that are appearance, disappearance, self-evolution, merge and split.

    Native language identification of fluent and advanced non-native writers

    Get PDF
    This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing in April 2020, available online: https://doi.org/10.1145/3383202 The accepted version of the publication may differ from the final published version.Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.Research funded by Higher Education Commission, and Grants for Development of New Faculty Staff at Chulalongkorn University | Digital Economy Promotion Agency (# MP-62-0003) | Thailand Research Funds (MRG6180266 and MRG6280175).Published versio

    Efficient Algorithms for High Dimensional Data Mining

    No full text
    Data mining and knowledge discovery has attracted research interest in the last decade. The size and complexity of real world data is dramatically increasing and although new efficient algorithms to deal with such data are constantly being proposed, the mining of high dimensional data still presents a challenge. In this dissertation, several novel algorithms are proposed to handle such datasets. These algorithms are applied to domains as diverse as electrocardiography (ECG), electroencephalography (EEG), human DNA sequencing, protein sequencing, stock market data, gesture recognition data, motion capture data, accelerometer data, audio data, image data, handwritten manuscripts, etc. This dissertation contributes to the data mining community in three ways:Firstly, we propose a novel algorithm for searching for the nearest neighbor in time series data by using multi-level lower bounding techniques and other speed-up techniques. The proposed algorithm, called UCRSuite, is faster than the previous state-of-the-art by several orders of magnitude. Because search algorithms are primitive and a bottleneck in complex data mining algorithms, this contribution is likely to make a significant impact. Secondly, we propose two approximation algorithms to handle the high dimensional data. A fast shapelet discovery algorithm, called FastShapelet, has been proposed to discover approximate shapelets, which are as accurate as those found by an exact search. In addition, we show an unsupervised algorithm, called DocMotif, which can discover similar figures from given manuscripts. The proposed algorithms are faster than the best known algorithms by two or three orders of magnitude and the discovered results are not measurably different from the exact algorithm. Moreover, in the second work, a detailed mathematical analysis for bounding an error is provided.In my final contribution, we show that in order to create a useful clustering of a single time series, an algorithm must have the freedom to ignore some data. We propose a Minimum Description Length based time series clustering algorithm that has this ability. My results demonstrate that not only is the proposed algorithm parameter-free, but it is also efficient and effective for time series clustering

    Fast Shapelets: A Scalable Algorithm for Discovering Time Series Shapelets

    No full text
    Time series shapelets are a recent promising concept in time series data mining. Shapelets are time series snippets that can be used to classify unlabeled time series. Shapelets not only provide interpretable results, which are useful for domain experts and developers alike, but shapelet-based classifiers have been shown by several independent research groups to have superior accuracy on many datasets. Moreover, shapelets can be seen as generalizing the lazy nearest neighbor classifier to an eager classifier. Thus, as a deployed classification tool, shapelets can be many orders of magnitude faster than any rival with comparable accuracy. Although shapelets are a useful concept, the current literature bemoans the fact that shapelet discovery is a time-consuming task. In spite of several efforts to speed up shapelet discovery algorithms, including the use of specialist hardware, the current state-of-the-art algorithms are still intractable on large datasets. In this work, we propose a fast shapelet discovery algorithm that outperforms the current state-of-the-art by two or three orders of magnitude, while producing models with accuracy that is not perceptibly different.

    Mining Historical Documents for Near-Duplicate Figures

    No full text
    Abstract—The increasing interest in archiving all of humankind’s cultural artifacts has resulted in the digitization of millions of books, and soon a significant fraction of the world’s books will be online. Most of the data in historical manuscripts is text, but there is also a significant fraction devoted to images. This fact has driven much of the recent increase in interest in query-by-content systems for images. While querying/indexing systems can undoubtedly be useful, we believe that the historical manuscript domain is finally ripe for true unsupervised discovery of patterns and regularities. To this end, we introduce an efficient and scalable system which can detect approximately repeated occurrences of shape patterns both within and between historical texts. We show that this ability to find repeated shapes allows automatic annotation of manuscripts, and allows users to trace the evolution of ideas. We demonstrate our ideas on datasets of scientific and cultural manuscripts dating back to the fourteenth century. are typical examples from the perhaps hundreds of books on Diatoms published during the Victorian era [17][22][25]. Thanks to efforts by digital archivists, hundreds of these works, representing over one million individual shapes, have been digitized and placed online. Some of them are scholarly classics, such as W. & G.S. West: A Monograph of the British Desmidiaceae[25], which is still referenced in modern scientific texts, and some of them are vanity publications by “gentlemen scholars”. Keywords-component; cultural artifacts; duplication detection; repeated patterns I

    A Novel Approximation to Dynamic Time Warping allows Anytime Clustering of Massive Time Series Datasets

    No full text
    Given the ubiquity of time series data, the data mining community has spent significant time investigating the best time series similarity measure to use for various tasks and domains. After more than a decade of extensive efforts, there is increasing evidence that Dynamic Time Warping (DTW) is very difficult to beat. Given that, recent efforts have focused on making the intrinsically slow DTW algorithm faster. For the similarity-search task, an important subroutine in many data mining algorithms, significant progress has been made by replacing the vast majority of expensive DTW calculations with cheap-to-compute lower bound calculations. However, these lower bound based optimizations do not directly apply to clustering, and thus for some realistic problems, clustering with DTW can take days or weeks. In this work, we show that we can mitigate this untenable lethargy by casting DTW clustering as an anytime algorithm. At the heart of our algorithm is a novel data-adaptive approximation to DTW which can be quickly computed, and which produces approximations to DTW that are much better than the best currently known linear-time approximations. We demonstrate our ideas on real world problems showing that we can get virtually all the accuracy of a batch DTW clustering algorithm in a fraction of the time

    MDL-Based Time Series Clustering

    No full text
    ABSTRACT Time series data is pervasive across all human endeavors, and clustering is arguably the most fundamental data mining application. Given this, it is somewhat surprising that show that the Minimum Description Length (MDL) framework offers an efficient, effective and essentially parameter-free method for time series clustering. We show that our method produces objectively correct results on a wide variety of datasets from medicine, speech recognition, zoology, gesture recognition, and industrial process analyses

    Image Mining of Historical Manuscripts to Establish Provenance

    No full text
    he recent digitization of more than twenty million books has been led by initiatives from countries wishing to preserve their cultural heritage and by commercial endeavors, such as the Google Print Library Project. Within a few years a significant fraction of the world’s books will be online. For millions of intact books and tens of millions of loose pages, the provenance of the manuscripts may be in doubt or completely unknown, thus denying historians an understanding of the context of the content. In some cases it may be possible for human experts to regain the provenance by examining linguistic, cultural and/or stylistic clues. However, such experts are rare and this investigation is clearly a time-consuming process. One technique used by experts to establish provenance is the examination of the ornate initial letters appearing in the questioned manuscript. By comparing the initial letters in the manuscript to annotated initial letters whose origin is known, the provenance can be determined. In this work we show for the first time that we can reproduce this ability with a computer algorithm. We leverage off a recently introduced technique to measure texture similarity and show that it can recognize initial letters with an accuracy that rivals or exceeds human performance. A brute force implementation of this measure would require several years to process a single large book; however, we introduce a novel lower bound that allows us to process the books in minutes
    corecore